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Abstract 

A single, stationary topic model such as la- 
tent Dirichlet allocation is inappropriate for 
modeling corpora that span long time peri- 
ods, as the popularity of topics is likely to 
change over time. A number of models that 
incorporate time have been proposed, but in 
general they either exhibit limited forms of 
temporal variation, or require computation- 
ally expensive inference methods. In this pa- 
per we propose non-parametric Topics over 
Time (npTOT), a model for time- varying 
topics that allows an unbounded number of 
topics and flexible distribution over the tem- 
poral variations in those topics' popularity. 
We develop a collapsed Gibbs sampler for the 
proposed model and compare against existing 
models on synthetic and real document sets. 



1 Introduction 



Latent variable models, such as latent Dirichlet alloca- 



tion (LDA, |Blei et al. , 2003) and hierarchical Dirichlet 
processes (HDP, Teh et al. , 2006), are popular choices 



for modeling text corpora. Documents are modeled as 
a distribution over a shared set of topics, which are 
themselves distributions over words. Each word in a 
document is assumed to be generated by one of these 
topics. 

Most topic models assume that the documents are ex- 
changeable, or in other words, that the order in which 
they appear is irrelevant. This is often not a reason- 
able assumption - the distribution over topics in to- 
day's newspaper is likely to be more similar to the 
distribution over topics in yesterday's newspaper than 
to the distribution over topics in a paper from a year 
ago. Similarly, popular topics on Twitter are likely to 
vary with both time and geographic location. 



A number of models have been proposed to address 
this. Dependent Dirichlet processes (MacEachern 
1999) are distributions over collections of distributions, 
each indexed by a location in some covariate space 
(e.g. time), such that distributions that are close to- 
gether in that space tend to be similar. Various forms 
of dependent Dirichlet process have been used to con- 
struct time-dependent topic models. Many of these 
models are limited in the form of variation obtained 



for example the in the models of Lin et al. (2010) 



and Rao and Teh (2009) the probability of seeing a 
topic as a function of time is restricted to be unimodal. 
Moreover, these models are difficult to apply to higher 
dimensional spaces, and often rely on the discretiza- 



tion of time. More flexible models, such as Srebro and 



Roweis fl2005|) and |MacEachern (2000), tend to lose 



the desireable conjugace properties of the correspond- 
ing stationary model, making inference challenging. 

An alternative approach is seen in a model known as 



Topics over Time (TOT, |Wang and McCallum[|2006 ). 
Unlike the previously discussed models, which define 
a distribution over topics conditional on a time, TOT 
models the text and the timestamp of a document 
jointly. This allows us to consider the timestamp as a 
random variable, rather than a fixed parameter. Such 
a framework allows us to incorporate non-Markovian 
dynamics while maintaining reasonable inference re- 
quirements. It also means that we can incorporate 
data without covariate information (for example, doc- 
uments with no timestamp) , something that is not eas- 
ily achieved in conditional models such as dependent 
Dirichlet processes. 

Like the conditional models, Topics over Time suffers 
from a number of shortcomings. The distribution over 
times for each topic is assumed to be unimodal, while 
in real life we often see topics vary in popularity in a 
more flexible manner. For example, Figure [I] shows the 
popularity of the search term "NHL" as a google query. 
The popularity waxes and wanes with the hockey sea- 
son, and occasionally peaks due to a major news event. 




Figure 1: Search and news trends for "NHL" obtained 
from Google Trends. This shows that interest in this 
topic rises and falls multiple times. 



In addition, the number of topics must be fixed a pri- 
ori, which can involve expensive model comparison. 

In this paper, we propose a nonpar ametric extension to 
the Topics over Time model (npTOT). This model ex- 
tends TOT to allow an unbounded number of topics, 
each of which can peak in popularity an unbounded 
number of times. In addition, npTOT induces corre- 
lations between the temporal variations in topic popu- 
larity, so that related topics trend in similar manners. 
Because, like TOT, npTOT is a joint model of both 
text and time, document/timestamp pairs can be con- 
sidered exchangeable and we can make use of tractable 
exchangeable distributions to develop a Gibbs sam- 
pling scheme. We compare npTOT with its paramet- 
ric counterpart, plus several baselines, and show that 
the added flexibility translates into qualitatively and 
quantitatively better performance. 

2 Related Work 



ber of topics a priori, meaning that a random number 
of topics will be used to represent a given dataset. 
The most widely used nonparametric topic model re- 
places the collection of Dirichlet distributions used to 
model the per-document distributions over topics in 



LDA with a hierarchical Dirichlet process (HDP, Teh 



et al. 2006). Here, the distribution over topics in a 



given document is given by a Dirichlet process. The 
document-specific Dirichlet processes are coupled us- 
ing a shared base measure, which is itself a Dirichlet 
process. 

2.2 Dependent Dirichlet processes 

The HDP assumes that the documents in our cor- 
pus are exchangeable. A class of models referred to 



as dependent Dirichlet processes (DDPs, MacEach- 



ern 



1999) relaxes this assumption. In topic models 
based on DDPs, each document is associated with a 
value in some covariate space, for example time. As 
in the HDP, the topic distribution of each document 
is marginally distributed according to a Dirichlet pro- 
cess. Unlike the HDP, documents that are close to- 
gether in covariate space tend to have similar distri- 
butions. 

A number of DDPs have been used in topic modeling. 



The recurrant Chinese restaurant process (Ahmed and 



|Xing| 2010) creates a Markov chain of distributions; 
however the model is non-exchangeable so we cannot 
make use of conjugacy in inferring the topic propor- 
tions. In addition, the model is only applicable to 
covariate spaces of a single dimension. A number of 



related models (Caron et al. , 2007 Lin et al. , 2010 Rao 



and Teh, 2009) maintain some of the conjugacy of the 



original model, but do not allow as flexible variation 
in topic probability. A number of DDPs can exhibit 
more flexible, non-Markovian variation in topic proba- 



bilities (Srebro and Roweis, 2005 MacEachern 2000), 



but inference in such models scales very poorly. 



Traditional topic models such as LDA have two main 
shortcomings. Firstly, they are parametric models 
that assume a fixed prespecified number of topics re- 
gardless of the data. Secondly, they assume that the 
probability of seeing a topic is independent of the time 
at which a document is written. In this section, we 
consider existing models that address one or both of 
these limitations. 

2.1 Nonparametric topic models 

To relax the assumption of a fixed number of top- 
ics, nonparametric topic models have been proposed. 
Rather than the fixed, finite number of topics specified 
by LDA, such models allow a countably infinite num- 



2.3 Topics over Time 

The DDP models mentioned in the previous section 
are examples of conditional models - the covariate is 
assumed fixed, and the model defines a distribution 
over topics conditioned on this covariate value. The 



Topics over Time (TOT, Wan g and McCallum[ |2006[ ) 
model takes a different tack, assuming that the covari- 
ate values are also random, and that the latent topics 
describe a distribution both over words and over times. 
This model is exchangeable if we consider a datapoint 
to consist of both a document's text and its timestamp, 
meaning we can make use of conjugacy. 



TOT is a form of supervised LDA (Blei and McAuliffe 



2007), where the label is the timestamp of the docu- 



merit. TOT assumes the following generative process 
for a corpus of documents and their associated times- 
tamps: 

1. For each topic/c = 1, . . . , if 

(a) Sample a distribution over words <pk ~ 
Dirichlet (if). 

(b) Choose a set of parameters ipk to parametrize 
a beta distribution. 

2. For each document j = 1, 2, . . . , D 

(a) Sample a distribution over topics, Oj\a ~ 
Dir(a). 

(b) For each word i = 1, . . . , Nj 

i. Sample a topic indicator Zji\9j ~ 6j. 

ii. Sample a word Wji\zji ~ Mult((j) Zji ). 

hi. Sample a timestamp tj%\zji ~ Beta(^..). 

This model exhibits non-Markovian variations in topic 
probabilities, but has a number of drawbacks. The 
beta distribution used to model the time- varying prob- 
ability is unimodal, and means that times must be 
bounded. This limits the form of temporal varia- 
tion available, and precludes prediction outside of the 
bounded time-frame or extension to higher dimension- 
alities. Moreover, the lack of a prior on ipk means it 
must be estimated using an approximate method. In 
addition, the number of topics must be defined a pri- 
ori. 

3 Nonparametric Topic Over Time 
(npTOT) 

In this section we address the two problems identified 
in ToT: Inflexible topic probability variation, and a 
fixed number of topics. The resulting model employs 
nonparametric distributions to generate both the dis- 
tribution over topics, and the distribution over times- 
tamps; therefore, we refer to this model as nonpara- 
metric Topics over Time (npTOT). 

We follow TOT in assuming that each document (in- 
dexed by j) consists of a (unordered) set of tokens 
(indexed by i). Each token Xji := (wji,tji) is defined 
to be an ordered pair of a word Wji and a timestamp 
assume that each document is generated by 
a distribution over multiple topics. 

The restriction to a fixed number of topics can be 
avoided by replacing the Dirichlet distribution over 
topics with a hierarchical Dirichlet process. This al- 
lows an unbounded number of topics a posteriori, and 
ensures the topics are shared across documents. 



x In practice, a document has a single timestamp which 
we duplicate for each word during inference. 



The form of temporal variation can be modified by re- 
placing the beta distribution in the ToT model with 
another choice of distribution. In order to model mul- 
timodal variation on an unbounded timeframe, while 
maintaining tractable inference, we choose to use a 
mixture of Gaussians. To allow a flexible distribution 
over time, we use a Dirichlet process as the mixing 
measure. 

We note that there may be correlations between the 
trending patterns of topics, something that is not ad- 
dressed in much of the dynamic topic modeling liter- 
ature. For example, topics to do with a sports players 
and sports fans are likely to have similar temporal vari- 
ation. We address this by allowing the components of 
our mixture of Gaussians to be shared between topics. 
This is achieved by sampling the mixture components 
from a hierarchical Dirichlet process. 

Let G be a normal-inverse Gamma distribution over 
the mean and variance of our time components, and 
assume 7, ao, ai and A to be fixed hyperparameters. 
Let GEM indicate the distribution over probability 
measures associated with the Dirichlet process. The 
generative process, represented by plate diagram [2| is 
defined as follows: 

1. Sample a global base distribution over topic pro- 
portions, J0I7 ~ GEM(7). 

2. Sample a global base measure over the 
means and variances of the time components, 
Lo|A,G ~ DP(\,G), such that each component 
parametrizes a univariate Gaussian. 

3. For each topic k = 1, 2, . . . , 

(a) Sample a distribution over words, <pk ~ 
Dirichlet (0). 

(b) Sample a topic-specific distribution over time 
components, L^oli^Lq ~ DP(a\ 1 Lq) 

4. For each document j = 1, 2, . . . , D 

(a) Sample a distribution over topics, Jj |ao, Jo ~ 
DP(a ,Jo) 

(b) For each word i = 1, . . . , Nj 

i. Sample a topic indicator Zji\ Jj ~ Jj. 

ii. Sample a word Wji\(j) Zji ~ Mu\t((/) Zji ) 

iii. Sample a time component ujji := 
(Hi^ji)\L Zji ~ L Zj . 

iv. Sample a timestamp tji\^ji^ji ~ 
N{nji,(Tji) 



4 Inference 



We propose a Gibbs sampler based on the Chinese 



restaurant franchise (CRF, Teh et al. , 2006). Our 




Figure 2: Plate diagram for nonparametric Topics over 
Time 



model requires two restaurant franchises, one for the 
word HDP and the other for the time HDP. 

The CRF associated with the word HDP mimics that 
described by |Teh et al. ( 2QQ6| ): each restaurant corre- 
sponds to a document and each dish corresponds to a 
topic. The CRF associated with the time HDP has a 
different interpretation: the restaurants correspond to 
the topics, and the dishes correspond to "time compo- 
nents" , which are associated with a Gaussian distribu- 
tion over time. We terms such as "time table", "time 
dish", "word table", and "word dish" to distinguish 
between the two franchises. 



Each token 
table 



r word 



(wji,tji) is associated with a word 



, Tl and a time table r|- me . Each word table a 
in document j is associated with a word dish (topic) 
d™° rd . Each time table b in a topic k is associated 



with a time dish (time component) d^™ 



We define 



rija as the number of tokens in document j associated 
with word table a; qj-b as the number of tokens associ- 
ated with topic k and time table b; and f(v,k) as the 
number of times the word v is associated with topic k. 

At each iteration of our Gibbs sampler, we need to 
sample, for each token i in document j, both the cor- 



responding word table and time table r. 



r time 



We 



also need to sample the topic d™° rd corresponding to 
each word table a in document j and the time compo- 
nent djfcfc 716 corresponding to each time table b in topic 
k. We describe these steps in detail in the remainder 
of this section. 



4.1 Sampling r\ 



ford 



Recall that each word table is associated not just with 
a distribution over words, but also with a distribu- 
tion over time tables. If we were to sample the word 
table for a token conditioned on that token's time ta- 



ble, our sampler would mix very slowly. Instead, we 
marginalize over rj- me in order to sample rj^ ord , and 
then sample rj- me condtioned on Tj- ord as described in 
Section 14.31 



The resulting distribution over time tables is given by 



p{Tf^ A = a\x,t,rest-3i) 
ocp(r^ or<i = a\T™?T d ) 

P(,^ji^ji\^~ji ~ O j ')*E—ji')t—ji'iT t £St—ji^) 



(i) 



where rest- 



f^word fjtzme ^.word ^.tzme^ Xhe 

component terms of Equation [I] are given by 

d\ J n 7a 1 if a i s an ex isting table 



p{rf° rd = a\r™°Z d ) oc { i 



and 



3 Z 



ao otherwise, 



(2) 



p(w ji ,t ji \Tjl ord = a,x-ji,t-ji,rest-ji) 
((3 + f( Wji ,k)) 



V(3 + E v =if(v,k) 



V{t 3 i\r™° rd = a,dj° rd = Kt_ ju rest_ 3i ) 
if a is an existing table serving dish fc, or 

p(w ji: t ji \rJ J i ord = a^x-j^t-j^rest-ji) 



K 



m.k /3 + f(wji,k) 



k=l 



p(tji\T. 



word 



jword _ u f .. rf3<i + .A 
Ji\ ' ji ' u ja ' V —Ji'> ' t-'^^—ji) 



m.. + 7 V 

( f , \ word _ iword _ u f .. rf >qt \ 

if a is a new table. In both cases, we can write 



P(tji\T?° rd = M*T d = k^-jurest-a) 

—ji 



_ji 9f]time(tji) 

c 



Oil 



1 X C=l 



r.. + 



where g c ~ J% denotes the posterior predictive time dis- 
tribution for token Xji conditioned on a time compo- 
nent c and other timestamps associated with that time 
component. For the Gaussian model described here, 
the posterior predictive distribution is a t-distribution. 



If a new word table is created for token Xji then we 
sample its corresponding word dish (topic) from the 
global word DP. 



4.2 Sampling d™° rd 



In order to resample the topic assignment dj° rd for 
an entire word table, we need to marginalize over 
the time table assignments of all the tokens (denoted 
(wj a ,tj a )) associated with that word table. 



Since the number of tokens at the word table might be 
large, summing over all possible assignments is infeasi- 
ble, so we approximate p(tj a \dj° rd = fc, t_ ja , rest-j a ) 
by sampling sets of table topic assignments. We use 
the resulting estimate p(tj a ) to approximate the true 
Gibbs sampling probabilities: 



p{d^ d = k\dr^ a d ,w,rest. ja ) 



k = 1, . . . , K. Each component has equal variance a 
such that 3 * C * yfo = 1. For each timestamp we 
generate one document. Each document is associated 
with a distribution over topics which is proportional to 
the probability of that document generating the times- 
tamp. Topics and words were sampled according to 
the LDA generative procedure. We set D to 100, K 
to 30, V to 100 and C to 10. An example of a single 
topic and the corresponding distribution over times is 
shown in Figure [3] 

We trained TOT and npTOT on the generated data. 
The number of topics in TOT was set to the true num- 
ber of topics, and as we see in Figure j3^b,d,e), the 
distributions over words obtained were a good match 
for the generating data. The npTOT model found 27 
topics, very close to the true value. As we see in Fig- 
ure [3(c)| TOT was unable to capture the variation 
in topics. Conversely, npTOT was able to capture 
the multimodality of their distribution with respect 



oc 



'm:r P {w ja \d^ d = k,w- ja )p(t ja ) if existing topic to time (Figure |3(e)J) . 



jp(w ja \d™° r d = k,w_ ja )p(t ja ) 



otherwise. 



4.3 Sampling rjj me and d%™ e 



Given the topic assignments, the distribution over the 
timestamps is independent of the rest of the model, 



and we can perform Gibbs sampling as in Teh et al 



(2006) 



5 Evaluation 

The goal of this paper was to increase the flexibility 
of TOT, an existing joint model for documents and 
their timestamps, by allowing multimodal variation in 
topic popularity, and by learning the number of topics. 
In this section, we present experimental results that 
demonstrate that we can capture more flexible varia- 
tion than TOT, and learn an appropriate number of 
topic components. Moreover, we show that this added 
flexibility translates into improved log likelihood on 
test datasets. 

5.1 Evaluation on Synthetic Data 

To demonstrate the ability of npTOT to recover the 
temporal variation of topics, we trained the model 
on a synthetic dataset, where ground truth is avail- 
able. We generated a dataset of D documents from 
K topics, each associated with a multinomial distri- 
bution over V words obtained by discretizing Gaus- 
sian distributions with means sampled uniformly on 
[0, V]. Each topic is also associated with a contin- 
uous distribution over time, distributed according to 
a mixture of C Gaussians with means at 0.5 + k/K, 



5.2 Real- world Data Experiments 

To show that npTOT is able to capture the temporal 
variation in real documents, we performed experiments 
on three datasets: 

• Twitter Subset. This dataset consists of tweets 
originating from Egypt in the time period from 
January through March 2011. We selected tweets 
given by active users where an active user is a user 
who has more than 200 tweets. As preprocessing, 
we removed words that are less than 3 charac- 
ters long. We then removed the most frequent 40 
words as well as words that occurred less than 10 
times. Finally, we aggregate the tweets of each 
user in each day in a single document and remove 
documents that are less than 20 words long. The 
preprocessed dataset contains 6,072 documents, 
9,080 unique words and 324,298 word tokens in to- 
tal. Because these tweets originated from Egypt, 
they contain both Arabic and English words. 

• State of the Union Address dataset. The 

State of the Union dataset contains the tran- 
scripts of 208 State of the Union addresses from 
1790 to 20 02. We followed |Wang and McCal-| 
lum (2006) in processing the dataset. Namely, 



we divided each speech into three-paragraph doc- 
uments, and removed stop words and numbers. 
This resulted in 5,897 documents, 22,620 unique 
words and 800,399 word tokens in total. 

• NIPS dataset. The NIPS data set consists 



2 http: / / www.gutenberg.org/dirs / text04 / sualll 1 .txt 
3 http:/ /cs. nyu.edu/ roweis/data.html 
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Figure 3: (a) shows the actual distribution over time for a particular topic on the synthetic data set, (b) shows 
the distribution over words for that particular topic, (c) shows the distribution over time of the TOT-detected 
topic closest to the original (d) shows the probability of the top- 10 words for the TOT topics in (c), (e) shows 
the distribution over time for corresponding topic found by npTOT and (f) shows the word distribution for the 
npTOT topic in (e) 



of the full text of the 12 years of proceedings 
from 1987 to 1999 Neural Information Process- 
ing Systems (NIPS) Conferences. The dataset is 
already preprocessed as described in |Globerson| 



et al. (2007) and it consists of 1,740 research pa- 
pers, 13,946 unique words and 2,301,375 word to- 
kens in total. 



5.3 Real- world Data Experiments 

To show that npTOT is able to capture the temporal 
variation in these datasets, we performed a qualita- 
tive analysis of the topics found, and a quantitative 
analysis of the predictive performance of the models. 



In addition to TOT, which we will refer to as TOT- 
Unimodal in this section, we evaluated against three 
other baselines: 



• LDA-Unimodal Here we ran LDA on the text 
of the documents, and then fit the temporal vari- 
ation of each topic with a single Gaussian distri- 
bution. 



• LDA-Multimodal Here we ran LDA on the text 
of the documents, and then fit the temporal vari- 
ation of each topic with a mixture of Gaussians. 



TOT-Multimodal. Here, we restricted npTOT 
to have a fixed number of topics, in order to dis- 
ambiguate the effect of an unbounded number of 
topics from the effect of using a more flexible dis- 
tribution over time. 



5.3.1 Qualitative analysis 

To see how npTOT can capture a wider variety of tem- 
poral variation than TOT, consider topics found using 
both models. Figure [4] shows topics found in the Twit- 
ter and the State of the Union addresses. We hand- 
picked topics that addressed the same themes for the 
purpose of this comparison. On the Twitter dataset, 
we see a topic that arises with the outbreak of rev- 
olution in Egypt on January, 25, 2011. Both models 
capture a sharp peak in this topic at that time, but the 
slow decay shown by the npTOT model is more real- 
istic than the sharp decline in interest implied by the 
TOT model. On a subset of the State of the Union 
dataset, we show a topic concerned with conflict in- 
volving the US and Britain. Both models show a sharp 
peak in this topic around the time of the War of 1812, 
but the nonparametric model is able to reuse this topic 
to describe tensions between the US and Britain lead- 
ing to the declaration of the war, such as the Embargo 
Act of 1807. 




Figure 4: Left: Top eight most probable words in topics from TOT and npTOT corresponding to the Egyptian 
revolution (started on Jan 25) in Twitter dataset. Right: Top ten most probable words in topics from TOT 
and npTOT corresponding to conflicts involving the US and Britain in the State of the Union address dataset. 



Figure [5] shows how related topics can share time com- 
ponents to give similar temporal variation. Since Twit- 
ter data is bilingual, we expect pairs of topics that 
address similar issues but in different languages. The 
figure shows two such pairs, demonstrating that they 
share the same time components. 

5.3.2 Quantitative analysis 

We evaluated the performance of npTOT and its com- 
petitors using two methods: Joint likelihood of a doc- 
ument and its timestamp, and perplexity of the second 
half of a document, conditioned on its timestamp and 
the first half of the text. The joint likelihood gives a 
general measure of how well the various methods are 
able to model the corpora. The perplexity task demon- 
strates how well we are able to make use of temporal 
information to predict the content of a document. 

In each case, we randomly split each data set into 
training and test sets using a 70:30 split, and learned 
all four models on the training set. The LDA mod- 
els were run for 1000 iterations, and npTOT and 
TOT were run until the percentage of changed to- 
kens was below 5%. The joint log likelihood was ob- 
tained using the harmonic mean method, as described 
in Wallach et al. (2009), by sampling topic assignments 



s) 



Zd\&,Wd,td, (where denotes the estimated 
model parameters) and taking the harmonic mean of 
the conditional likelihoods P(wd,td\z^\&) over 200 
samples. 

We evaluated perplexity using the estimated method 
described in Wallach et al. (2009): for each test 



document d we sample topic assignments Zd ~ 
Zd\&,w^\td, where 3? denotes the estimated model 
parameters, td and w^p denote time stamp of doc- 
ument d and the words in the first half of the doc- 
ument respectively. For each sample of z^ we es- 



W 

dk 



P(k\z^\a), which we use to esti- 
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mate the likelihood of the second half of the doc- 
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Figure 6: Average per-token negative log likelihood on 
test set for Twitter dataset 
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Figure 7: Average per-token negative log likelihood on 
test set for State-of-the-union-address data set 



k,<&)§^. Taking the product over all document 
and then averaging over samples of z^ gives an 
estimate of the document completion likelihood. 
The perplexity score we report is evaluated as 
exp(— completion log likelihood/N), where N is the to- 
tal number of words in the test set. 

Figures [6j [7] and [8] show the resulting log joint 
likelihoods on the three data sets. In each case, 
npTOT gives the best likelihood, and the baseline 
LDA-Unimodal and LDA-Multimodal models perform 
poorly. The two TOT models, TOT-Multimodal and 





Figure 5: Distributions over time for two discovered English topics (left) and their Arabic counterparts (right), 
showing that they share time components. The top topic is about the Egyptian revolution outbreak. The bottom 
topic is about a referendum on constitutional amendments. Each panel shows words selected from the top twelve 
most probable words in the corresponding topic. Words in parentheses are translated from Arabic. 



TOT-Unimodal, perform comparably, and approach 
the performance of npTOT as the number of topics 
reaches that found by npTOT. This is not surprising; 
a parametric model with the "right" number of topics 
should perform as well as a nonparametric model. The 
advantage of a nonparametric model such as npTOT 
is that we do not need to specify the number of topics 
a priori, or perform expensive model comparisons, to 
obtain good results. 



Figures |9j [To] and [TT] show the perplexity obtained 
through document completion. Again, we find that 
npTOT obtains lower perplexity, indicating that it is 
better able to predict held-out text. In particular, 
note that the LDA models learned without an explicit 
model of time perform very poorly. As expected, hav- 
ing information about when a document is written, 
and having a model sophisticated enough to make use 
of this information, allows us to make better guesses 
about the content of that document. 

6 Conclusions and Future Work 

The goal of this paper was to develop a flexible model 
for capturing time- varying topics in text corpora where 
the total number of topics is not known a priori. By ex- 
tending the TOT model to incorporate nonparametric 
distributions over both words and timestamps, we have 
presented a model that is able to find interpretable 
topics and achieve good predictive performance on 
held-out data. 
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Figure 8: Average per-token negative log likelihood on 
test set for NIPS dataset. 
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Figure 9: Perplexity on test set for Twitter dataset 
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Figure 10: Perplexity on test set for State-of-the- 
union-address data set 



900 



300 













— npTOT 

— * * TOT-Multimodal 

— ■- TOT-Unimodal 

LDA-Multimodal 
■ LDA-Unimodal 




















— fri 






^ — + 

-4 - ■!- - fl 










10 


20 


30 


40 50 60 






Number of topics 



Figure 11: Perplexity on test set for NIPS dataset. 



is that it can easily be extended to higher dimensional 
covariate values. This would enable us to model ge- 
ographical variations in topic popularity. In addition 
to modeling documents, topic models have been used 
to model images (Fei-Fei and Perona 2005). This is 



another area where spatially dependent topic models, 
based on the npTOT, could be employed. 

While an infinite mixture of Gaussians is flexible and 
tractable, other distributions may be more applica- 
ble in certain situations. For example, when modeling 
news articles, a topic often peaks suddenly and then 
dies down gradually. An interesting future approach 
would be to use asymmetric distributions, such as ex- 
ponential distributions, to capture this effect. 
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